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ABSTRACT 

Motivation: Analogous to biological sequence comparison, compa- 
ring cellular networks is an important problem that could provide 
insight into biological understanding and therapeutics. For techni- 
cal reasons, comparing large networks is computationally infeasible, 
and thus heuristics, such as the degree distribution, clustering coeffi- 
cient, diameter, and relative graphlet frequency distribution have been 
sought. It is easy to demonstrate that two networks are different by 
simply showing a short list of properties in which they differ. It is much 
harder to show that two networks are similar, as it requires demon- 
strating their similarity in all of their exponentially many properties. 
Clearly, it is computationally prohibitive to analyze all network proper- 
ties, but the larger the number of constraints we impose in determining 
network similarity, the more likely it is that the networks will truly be 
similar. 

Results: We introduce a new systematic measure of a network's local 
structure that imposes a large number of similarity constraints on 
networks being compared. In particular, we generalize the degree 
distribution, which measures the number of nodes "touching" k 
edges, into distributions measuring the number of nodes "touching" k 
graphlets, where graphlets are small connected non-isomorphic sub- 
graphs of a large network. Our new measure of network local structure 
consists of 73 graphlet degree distributions of graphlets with 2, 3, 4, 
and 5 nodes, but it is easily extendible to a greater number of cons- 
traints (i.e, graphlets), if necessary, and the extensions are limited 
only by the available CPU. Furthermore, we show a way to com- 
bine the 73 graphlet degree distributions into a network "agreement" 
measure which is a number between and 1, where 1 means that net- 
works have identical distributions and means that they are far apart. 
Based on this new network agreement measure, we show that almost 
all of the fourteen eukaryotic PPI networks, including human, resul- 
ting from various high-throughput experimental techniques, as well 
as from curated databases, are better modeled by geometric random 
graphs than by Erdos-Renyi, random scale-free, or Barabasi-Albert 
scale-free networks. 

Availability: Software executables are available upon request. 
Contact: natasha@ics.uci.edu 



1 INTRODUCTION 

Understanding cellular networks is a major problem in current 
computational biology. These networks are commonly modeled by 
graphs (also called networks) with nodes representing biomolecu- 
les such as genes, proteins, metabolites etc., and edges representing 



physical, chemical, or functional interactions between the biomole- 
cules. The ability to compare such networks would be very useful. 
For example, comparing a diseased cellular network to a healthy 
one may aid in finding a cure for the disease, and comparing cellu- 
lar networks of different species could enable evolutionary insights. 
A full description of the differences between two large networks 
is infeasible because it requires solving the subgraph isomorphism 
problem, which is an NP-complete problem. Therefore, analogous 
to the BLAST heuristic (1) for biological sequence comparison, we 
need to design a heuristic tool for the full-scale comparison of large 
cellular networks (5). The current network comparisons consist of 
heuristics falling into two major classes: 1) global heuristics, such 
as counting the number of connections between various parts of the 
network (the "degree distribution"), computing the average density 
of node neighborhoods (the "clustering coefficients"), or the average 
length of shortest paths between all pairs of nodes (the "diameter"); 
and 2) local heuristics that measure relative distance between con- 
centrations of small subgraphs (called graphlets) in two networks 
(27). 

Since cellular networks are incompletely explored, global sta- 
tistics on such incomplete data may be substantially biased, or 
even misleading with respect to the (currently unknown) full net- 
work. Conversely, certain neighborhoods of these networks are 
well-studied, and so locally based statistics applied to the well- 
studied areas are more appropriate. A good analogy would be to 
imagine that MapQuest knew details of the streets of New York City 
and Los Angeles, but had little knowledge of highways spanning 
the country. Then, it could provide good driving directions inside 
New York or L. A., but not between the two. Similarly, we have 
detailed knowledge of certain local areas of biological networks, but 
data outside these well-studied areas is currently incomplete, and so 
global statistics are likely to provide misleading information about 
the biological network as a whole, while local statistics are likely to 
be more valid and meaningful. 

Due to the noise and incompleteness of cellular network data, 
local approaches to analyzing and comparing cellular network struc- 
ture that involve searches for small subgraphs have been successful 
in analyzing, modeling, and discovering functional modules in cel- 
lular networks (22; 30; 21; 27). Note that it is easy to show that two 
networks are different simply by finding any property in which they 
differ. However, it is much harder to show that they are similar, since 
it involves showing that two networks are similar with respect to all 
of their properties. Current common approaches to show network 
similarity are based on listing several common properties, such as 
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the degree distribution, clustering, diameter, or relative graphlet fre- 
quency distribution. The larger the number of common properties, 
the more likely it is that the two networks are similar. But any short 
list of properties can easily be mimicked by two very large and dif- 
ferent networks. For example, it is easy to construct networks with 
exactly the same degree distribution whose structure and function 
differ substantially (27; 18; 8). 

In this paper, we design a new local heuristic for measuring 
network structure that is a direct generalization of the degree dis- 
tribution. In fact, the degree distribution is the first in the spectrum 
of 73 graphlet degree distributions that are components of this new 
measure of network structure. Thus, in our new network similarity 
measure, we impose 73 highly structured constraints in which net- 
works must show similarity to be considered similar; this is a much 
larger number of constraints than provided by any of the previous 
approaches and therefore it increases the chances that two networks 
are truly similar if they are similar with respect to this new measure. 
Moreover, the measure can be easily extended to a greater number 
of constraints simply by adding more graphlets. The extensions are 
limited only by the available CPU. 

Based on this new measure of structural similarity between two 
networks, we show that the geometric random graph model shows 
exceptionally high agreement with twelve out of fourteen different 
eukaryotic protein-protein interaction (PPI) networks. Furthermore, 
we show that such high structural agreements between PPI and geo- 
metric random graphs are unlikely to be beaten by another random 
graph model, at least under this measure. 

1.1 Background 

Large amounts of cellular network data for a number of organisms 
have recently become available through high-throughput methods 
(13; 35; 11; 19; 32; 29). Statistical and theoretical properties of 
these networks have been extensively studied (15; 20; 30; 22; 36; 
41; 27; 34) due to their important biological implications (14; 17). 

Comparing large cellular networks is computationally intensive. 
Exhaustively computing the differences between networks is com- 
putationally infeasible, and thus efficient heuristic algorithms have 
been sought (16; 28). Although global properties of large networks 
are easy to compute, they are inappropriate for use on incomplete net- 
works because they can at best describe the structure produced by the 
biological sampling techniques used to obtain the partial networks 
(12). Therefore, bottom-up or local heuristic approaches for study- 
ing network structure have been proposed (22; 30; 27). Analogous 
to sequence motifs, network motifs have been defined as subgraphs 
that occur in a network at frequencies much higher than expected 
at random (22; 30; 21). Network motifs have been generalized to 
topological motifs as recurrent "similar" network sub-patterns (5). 
However, the approaches based on network motifs ignore infrequent 
subnetworks and subnetworks with "average" frequencies, and thus 
are not sufficient for full-scale network comparison. Therefore, small 
connected non-isomorphic induced subgraphs of a large network, 
called graphlets, have been introduced to design a new measure 
of local structural similarity between two networks based on their 
relative frequency distributions (27). 

The earliest attempts to model real-world networks include Erdos- 
Renyi random graphs (henceforth denoted by "ER") in which edges 
between pairs of nodes are distributed uniformly at random with 
the same probability p (9; 10). This model poorly describes several 



properties of real-world networks, including the degree distribution 
and clustering coefficients, and therefore it has been refined into 
generalized random graphs in which the edges are randomly cho- 
sen as in Erdos-Renyi random graphs, but the degree distribution 
is constrained to match the degree distribution of the real network 
(henceforth we denote these networks by "ER-DD"). Matching other 
global properties of the real-world networks to the model networks, 
such as clustering coefficients, lead to further improvements in mode- 
ling real-world networks including small-world (38; 23; 24) and 
scale-free (31; 3) network models (henceforth, we denote by "SF" 
scale-free Barabasi-Albert networks). Many cellular networks have 
been described as scale-free (4). However, this issue has been hea- 
vily debated (7; 33; 12; 34). Recently, based on the local relative 
graphlet frequency distribution measure, a geometric random graph 
model (25 ) has been proposed for high-confidence PPI networks (27 ). 



2 APPROACH 

In section 2.1, we describe the fourteen PPI networks and the four 
network models that we analyzed. Then we describe how we gene- 
ralize the degree distribution to our spectrum of graphlet degree 
distributions (section 2.2); note that the degree distribution is the 
first distribution in this spectrum, since it corresponds to the only 
graphlet with two nodes. Finally, we construct a new measure of 
similarity between two networks based on graphlet degree distribu- 
tions (section 2.3). We describe the results of applying this measure 
to the fourteen PPI networks in section 3. 

2.1 PPI and Model Networks 

We analyzed PPI networks of the eukaryotic organisms yeast S. cere- 
visiae, frutifly D. melanogaster, nematode worm C. elegans, and 
human. Several different data sets are available for yeast and human, 
so we analyzed five yeast PPI networks obtained from three different 
high-throughput studies (35; 13; 37) and five human PPI networks 
obtained from the two recent high- throughput studies (32; 29) and 
three curated data bases (2; 26; 42). We denote by "YHC" the high- 
confidence yeast PPI network as described by 37, by "Y11K" the 
yeast PPI network defined by the top 1 1 , 000 interactions in the 37 
classification, by "YIC" the 13 "core" yeast PPI network, by "YU" 
the 35 yeast PPI network, and by "YICU" the union of 13 core and 
35 yeast PPI networks (we unioned them as did 12 to increase cover- 
age). "FE" and "FH" denote the fruitfly D. melanogaster entire and 
high-confidence PPI networks published by 1 1 . Similarly, "WE" and 
"WC" denote the worm C. elegans entire and "core" PPI networks 
published by 19. Finally, "HS", "HG", "HB", "HH", and "HM" 
stand for human PPI networks by 32, 29, from BIND (2), HPRD 
(26), and MINT (42), respectively (BIND, HPRD, and MINT data 
have been downloaded from OPHID (6) on February 10, 2006). Note 
that these PPI networks come from of a wide array of experimental 
techniques; for example, YHC and Yl IK are mainly coming from 
tandem affinity purifications (TAP) and high throughput MS/MS pro- 
tein complex identification (HMS-PCI), while YIC, YU, YICU, FE, 
FH, WE, WH, HS, and HG are yeast two-hybrid (Y2H), and HB, HH, 
and HM are a result of human curation (BIND, HPRD, and MINT). 

The four network models that we compared against the above four- 
teen PPI networks are ER, ER-DD, SF, and 3-dimensional geometric 
random graphs (henceforth denoted by "GEO-3D"). Model networks 
corresponding to a PPI network have the same number of nodes and 
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the number of edges within 1% of the PPI network's (details of the 
construction of model networks are presented by 27). For each of the 
fourteen PPI networks, we constructed and analyzed 25 networks 
belonging to each of these four network models. Thus, we analyzed 
the total of 14 + 14 • 4 • 25 = 1, 414 networks. We compared the 
agreement of each of the fourteen PPI networks with each of the 
corresponding 4-25 = 100 model networks described above (our 
new agreement measure is described in section 2.3). The results of 
this analysis are presented in section 3. 

2.2 Graphlet Degree Distribution (GDD) 

We generalize the notion of the degree distribution as follows. The 
degree distribution measures, for each value of k , the number of nodes 
of degree k. In other words, for each value of k, it gives the number 
of nodes "touching" k edges. Note that an edge is the only graphlet 
with two nodes; henceforth, we call this graphlet Go (illustrated 
in Figure 1). Thus, the degree distribution measures the following: 
how many nodes "touch" one Go, how many nodes "touch" two 
Gos, . . ., how many nodes "touch" k Gos. Note that there is nothing 
special about graphlet Go and that there is no reason not to apply the 
same measurement to other graphlets. Thus, in addition to applying 
this measurement to an edge, i.e., graphlet Go, as in the degree 
distribution, we apply it to the twenty-nine graphlets G\ , G%, ■ ■ ■ G29 
presented in Figure 1 as well. 



gra'phlet 3_node g ra P hlet: 



4-node graphlets 



1 : 5AliVnY4>A 
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Fig. 1. Automorphism orbits 0, 1, 2, . . . , 72 for the thirty 2, 3, 4, and 5- 
node graphlets Go, Gi, . . . , G29- In a graphlet G,, (' e {0, 1, . . . 29), nodes 
belonging to the same orbit are of the same shade. 



When we apply this measurement to graphlets Go, ... , G29, we 
need to take care of certain topological issues that we first illustrate in 
the following example and then define formally. For graphlet G\, we 
ask how many nodes touch a G 1 ; however, note that it is topologically 
relevant to distinguish between nodes touching a Gi at an end or at 
the middle node. This is due to the following mathematical property 
of G\ \ a G\ admits an automorphism (defined below) that maps its 
end nodes to each other and the middle node to itself. To understand 
this phenomenon, we need to recall the following standard mathe- 
matical definitions. An isomorphism g from graph X to graph Y is a 
bijection of nodes of X to nodes of Y such that xy is an edge of X if 
and only if g(x)g(y) is an edge of Y; an automorphism is an isomor- 
phism from a graph to itself. The automorphisms of a graph X form a 
group, called the automorphism group of X, and commonly denoted 
by Aut(X). If x is a node of graph X, then the automorphism orbit 



of x is Orb(x) = {y e V(X)|y = g(x) for some g e Aut(X)}, 
where V(X) is the set of nodes of graph X. Thus, end nodes of a 
G 1 belong to one automorphism orbit, while the mid-node of a G 1 
belongs to another. Note that graphlet Go (i.e., an edge) has only 
one automorphism orbit, as does graphlet G2; graphlet G3 has two 
automorphism orbits, as does graphlet G4, graphlet G5 has one auto- 
morphism orbit, graphlet G(, has three automorphism orbits etc. (see 
Figure 1). In Figure 1, we illustrate the partition of nodes of gra- 
phlets Go, G\, . . . , G29 into automorphism orbits (or just orbits for 
brevity); henceforth, we number the 73 different orbits of graphlets 
Go, Gi, . . . , G29 from to 72, as illustrated Figure 1. Analogous to 
the degree distribution, for each of these 73 automorphism orbits, we 
count the number of nodes touching a particular graphlet at a node 
belonging to a particular orbit. For example, we count how many 
nodes touch one triangle (i.e., graphlet G2), how many nodes touch 
two triangles, how many nodes touch three triangles etc. We need 
to separate nodes touching a G 1 at an end-node from those touching 
it at a mid-node; thus we count how many nodes touch one G\ at 
an end-node (i.e., at orbit 1), how many nodes touch two Gis at an 
end-node, how many nodes touch three Gis at an end-node etc. and 
also how many nodes touch one G\ at a mid-node (i.e., at orbit 2), 
how many nodes touch two Gis at a mid-node, how many nodes 
touch three Gis at a mid-node etc. In this way, we obtain 73 dis- 
tributions analogous to the degree distribution (actually, the degree 
distribution is the distribution for the 0''' orbit, i.e., for graphlet Go). 
Thus, the degree distribution, which has been considered to be a glo- 
bal network property, is one in the spectrum of 73 "graphlet degree 
distributions ( GDDs ) " measuring local structural properties of a net- 
work. Note that GDD is measuring local structure, since it is based 
on small local network neighborhoods. The distributions are unlikely 
to be statistically independent of each other, although we have not 
yet worked out the details of the inter-dependence. 

2.3 Network "GDD Agreement" 

There are many ways to "reduce" the large quantity of numbers repre- 
senting 73 sample distributions. In this section, we describe one way; 
there may be better ways, and certainly finding better ways to reduce 
this data is an obvious future direction. Some of the details may seem 
obscure at first; we justify them at the end of this section. 

We start by measuring the 73 graphlet degree distributions (GDDs) 
for each network that we wish to compare. Let G be a network (i.e., 
a graph). For a particular automorphism orbit j (refer to Figure 1), 
let d^(k) be the sample distribution of the number of nodes in G 
touching the appropriate graphlet (for automorphism orbit j ) k times. 
That is, d-Q represents the /'* graphlet degree distribution (GDD). 
We scale d ' G (k) as 

4o<) 



sUk) 



(i) 



to decrease the contribution of larger degrees in a GDD (for reasons 
we describe later that are illustrated in Figure 2), and then normalize 
the distribution with respect to its total area 1 , 



(2) 



in practice the upper limit of the sum is finite due to finite sample size 
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giving the "normalized distribution" 

N' G {k) = -Zy.. (3) 

In words, N G (k) is the fraction of the total area under the curve, over 
the entire GDD, devoted to degree k. Finally, for two networks G 
and H and a particular orbit j, we define the "distance" D J (G, H) 
between their normalized j distributions as 

D^G,H) = ^N J G (k)-N J H (k) 2 J , (4) 

where again in practice the upper limit of the sum is finite due to 
the finite sample. The distance is between and 1, where means 
that G and H have identical j' 1 ' GDDs, and 1 means that their / 
GDDs are far away. Next, we reverse D'(G, H) to obtain the j th 
GDD agreement: 

A'(G, tf) = l- D J (G,H), (5) 

for j e {0, 1, . . . , 72}. Finally, the agreement between two networks 
G and H is either the arithmetic (equation 6) or geometric (equation 
7) mean of A'(G, H) over all j, i.e., 

1 72 

Aanth (G, H) = — ^ A J (G, H), (6) 

7=0 

and 

I 12 \* 

A ge0 (G,H)= m Ai(G,H)\ . (7) 

Now we give the rationale for designing the agreement measure 
in this way. There are many different ways to design a measure of 
agreement between two distributions. They are all heuristic, and thus 
one needs to examine the data to design the agreement measure that 
works best for a particular application. The justification of our choice 
of the graphlet degree distribution agreement measure can be illu- 
strated by an example of two GDDs for the yeast high-confidence 
PPI network (37) and the corresponding 3-dimensional geometric 
random networks presented in Figure 2. This Figure gives an illu- 
stration of the GDDs of orbit 1 1 of the PPI and average GDD of orbit 
11 in 25 model networks (panel A) being "closer" than the GDDs 
of orbit 16 (panel B); this is accurately reflected by our agreement 
measure which gives an agreement of 0.89 for orbit 1 1 GDDs and of 
0.51 for orbit 16. However, note that the sample distributions extend 
in the x axis out to degrees of 10 4 or even 10 5 ; we believe that most of 
the "information" in the distribution is contained in the lower degrees 
and that the information in the extreme high degrees is noise due to 
bio-technical false positives caused by auto-activators or sticky pro- 
teins (12). However, without scaling by l/k as in equation (1), both 
the area under the curve (2) and the distance (4) would be dominated 
by the counts for large k. This explains the scaling in equation (1). 
The "normalization", equation (3), in performed in order to force 
both distributions to have a total area under the curve of 1 before 
they are compared. We can now compute, for each value of k, the 
"distance" between two distributions at that value of k. Formally k 



is unbounded but in practice it is finite due to the finite size of the 
graph. We then treat the vector of distances as a vector in the unit 
cube of dimension equal to the maximum value of k. We compute 
the Euclidean distance between two of these vectors, representing 
two networks, in equation (4). Finally, we choose to switch from 
"distance" to "agreement" in equation (5) simply because we feel 
agreement is a more intuitive measure. 

To gauge the quality of this agreement measure, we computed the 
average agreements between various model (i.e., theoretical) net- 
works. For example, when comparing networks of the same type 
(ER vs ER, ER-DD vs ER-DD, GEO-3D vs GEO-3D, or SF vs SF), 
we found the mean agreement to be 0.84 with a standard deviation of 
0.07. To verify that our "agreement" measure can give low values for 
networks that are very different, we also constructed a "straw-man" 
model graph called a circulant, and compared it to some actual PPI 
network data. A circulant graph is constructed by adding "chords" 
to a cycle on n nodes (examples of cycles on 3, 4, and 5 nodes are 
graphlets G2, G5, and G 15, respectively) so that i'" node on the cycle 
is connected to the (i + j) mod n and (i — j) mod n' h node on 
the cycle. Clearly, a large circulant with an equal number of nodes 
edge density as the data would not be very representative of a PPI 
network, and indeed we find that the agreement between such a cir- 
culant, with chords defined by j e {6, 12}, and the data is under 
0.08. Note that in most of the fourteen PPI networks, the number of 
edges is abut 3 times the number of nodes, so we chose circulants 
with three times as many edges as nodes; also, we chose j > 5 to 
maximize the number of 3, 4, and 5-node graphlets that do not occur 
in the circulant, since all of the 3, 4, and 5-node graphlets occur in 
the data. 



3 RESULTS AND DISCUSSION 

We present the results of applying the newly introduced "agreement" 
measure (section 2.3) to fourteen eukaryotic PPI networks and their 
corresponding model networks of four different network model types 
(described in section 2.1). The results show that 3-dimensional geo- 
metric random graphs have exceptionally high agreement with all of 
the fourteen PPI networks. 

We undertook a large-scale scientific computing task by imple- 
menting the above described new methods and using them to compare 
agreements across the four random graph models of fourteen real PPI 
networks. Using these new methods, we analyzed a total of 1, 414 
networks: fourteen eukaryotic PPI networks of varying confidence 
levels described in section 2. 1 and 25 model networks per random 
graph model corresponding to each of the fourteen PPI networks, 
where random graph models were ER, ER-DD, SF, and GEO-3D 
(described in section 2.1). The largest of these networks had around 
7, 000 nodes and over 20, 000 edges. For each of the fourteen PPI 
networks and each of the four random graph models, we compu- 
ted averages and standard deviations of graphlet degree distribution 
(GDD) agreements between the PPI and the 25 corresponding model 
networks belonging to the same random graph model. The results are 
presented in Figure 3. 

Erdos-Renyi random graphs (ER) show about 0.5 agreement with 
each of the PPI networks while scale-free networks of type ER-DD 
and SF show a slightly improved agreement (ER-DD networks are 
random scale-free, since the degree distributions forced on them by 
the corresponding PPI networks roughly follow power law). Note 
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(A) 



Orbit 1 1 GDD in Yeast High-Conf. PPI and GEO-3D Networks 




High-Conf. PPI Net 

Average for GEO-3D Networks 



Agreement - 0.89 




(B) 
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Orbit 16 GDD in Yeast High-Conf. PPI and GEO-3D Networks 
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Fig. 2. Examples of graphlet degree distributions (GDDs) for yeast high-confidence PPI network (37) (solid red line) and the average of 25 corresponding 
3-dimensional geometric random networks (GEO-3D, dashed green line): A. Orbit 11. B. Orbit 16. Most counts beyond about k = 20 are zero, with a few 
instances of 1 or (very occasionally) 2. This results in a large amount of red and green ink which is mostly noise, as the distribution fluctuates frequently from 
1 to (which is — oo on our log scale). The noise could be reduced by applying a broad-band filter, but we have chosen to leave the data in its raw state, despite 
the deleterious effect on the aesthetics of the plot. 
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(B) 



Geometric Averages of Agreement Between PPI and Model Networks 
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Fig. 3. Agreements between the fourteen PPI networks and their corresponding model networks. Labels on the horizontal axes are described in section 2.1. 
Averages of agreements between 25 model networks and the corresponding PPI network are presented for each random graph model and each PPI network, 
i.e., at each point in the Figure; the error bar around a point is one standard deviation below and above the point (in some cases, error bars are barely visible, 
since they are of the size of the point). As described in section 2.3, the agreement between a PPI and a model network is based on the: A. arithmetic average 
of j'" GDD agreements; B. geometric average of GDD agreements. 



that GEO-3D networks show the highest agreement for all but one 
of the fourteen PPI networks (Figure 3). For HS PPI network, it is 
not clear which of the GEO-3D, SF, and ER-DD models agrees the 
most with the data, since the average agreements of HS network with 
these models are about the same and within one standard deviation 
from each other. GEO-3D and SF model are similarly tied for the 
FE network. Since networks belonging to the same random graph 
model have average agreement of 0.84 with a standard deviation of 
0.07 (shown in section 2.3), the agreements of over 0.7, that most 
of the PPI networks have with the GEO-3D model, are very good. 
Note that eight out of the fourteen PPI networks have agreements 
with GEO-3D model of over 0.75; since networks of the same type 
agree on average by 0.84 ±0.07, we conclude that the agreements of 



0.75 are exceptionally high and are unlikely to be beaten by another 
network model under this measure. Also, it is interesting that GEO- 
3D model shows high agreement with PPI networks obtained from 
various experimental techniques (Y2H, TAP, HMS-PCI) as well as 
from human curation (see section 2. 1). Note that this does not mean 
that GEO-3D is the best possible model. For example, it may be 
possible to construct a different "agreement" measure that is more 
sensitive and under which a model better than GEO-3D may be appa- 
rent. However, we believe that the current "agreement" measure is 
sensitive and meaningful enough to conclude that GEO-3D is a better 
model than ER, ER-DD, and SF. 
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4 CONCLUSION 

We have constructed a new measure of structural similarity between 
large networks based on the graphlet degree distribution. The degree 
distribution is the first one in the sequence of graphlet degree dis- 
tributions that are constructed in a structured and systematic way to 
impose a large number of constraints on the structure of networks 
being compared. This new measure is easily extendible to a greater 
number of constraints simply by adding more graphlets to those in 
Figure 1, although this would add significantly to the cost of com- 
puting agreements; the extensions are limited only by the available 
CPU. Based on this new network similarity measure, we have shown 
that almost all of the fourteen eukaryotic PPI networks resulting from 
various high-throughput experimental techniques, as well as cura- 
ted databases, are better modeled by geometric random graphs than 
by Erdos-Renyi, random scale-free, or Barabasi-Albert scale-free 
networks. 
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